Given that the p53 family of proteins (p53, p63, and p73) serve overlapping functions in normal development and regulating the expression of genes that control apoptosis in humans, how do they compare in terms of amino acid sequence identity and 3D protein structure?
The p53 gene is a tumor suppressor gene most frequently mutated in human cancers. The resulting p53 protein is a transcription factor that regulates the expression of genes that control apoptosis and cell cycle arrest in response to genotoxic and cellular stress. p63 and p73, two close homologs of p53, comprise the p53 family of proteins. Despite having the ability to transactivate p53 target genes and exert p53-like functions such as DNA-damage induced apoptosis and cell cycle arrest, p63 and p73 do not function as classical Knudson-type tumor suppressors and are rarely mutated in human cancers (DeYoung 2007). Furthermore, in addition to having redundant p53-like functions, p63 and p73 possess an extended C-terminal region containing a sterile alpha motif (SAM) known to regulate development which is not alternatively spliced in p53. This suggests that p63 and p73 play a separate additional role in the regulation of normal development (Levrero 2000). Interestingly, further research hints that the relationship between the p53 family of proteins is a lot more complex than we might think: some p63/p73 isoforms are p53-interfering– they not only lack p53-like functions, but also act as dominant negatives against p53 activity (Yang 2002).
The intricate relationship between the p53 transcription factor family members and their overlapping but also opposing functions have been subject to extensive research and debate. Here we attempt to shed light to this question by investigating the amino acid sequence identity and structural homology of p53 family members.
If the p53 family of proteins all serve the redundant function in normal development and regulating the expression of genes that control apoptosis in humans, then we would expect the amino acid sequence identity to be >50% and the DNA-binding domains to share structural homology.
To compare and contrast the amino acid sequences of p53, p63, and p73, multiple sequence alignment with the MUSCLE algorithm was performed. This method will highlight differences in their sequence, including point mutations and indels. Between the three multiple sequence alignment algorithms in the msa package, the MUSCLE algorithm was used as it especially works well with proteins, and ClustalOmega is not suitable for alignment of sequences with large internal indels. The amino acid sequences of p53 (UniProt ID: P04637), p63 (Q9H3D4), and p73 (O15350) were downloaded in FASTA format from the UniProt database. To visualize the alignment, the msaPrettyPrint() function was used as it allows us to create a sequence logo, which displays a graphical representation of the sequence conservation of amino acids. It is a highly customizable multiple sequence alignment plot, assigning different colors for certain groups of amino acids to highlight differences between the sequences.
To compare the structural homology of p53 family members, homology modeling and structural bioinformatics were performed. Here, we focus only on the DNA-binding domain (DBD), as the p53 family members bind to very similar DNA motifs. The PDB files were obtained from RCSB PDB (Research Collaboratory for Structural Bioinformatics Protein Data Bank): p53 (Accession number: 2FEJ), p63 (2RMN), p73 (2XWC). The PDB files are then compared by alignment and superposition of the three structures using the pdbaln() function, calculating sequence identity using the seqidentity() function, and calculating the RMSD to measure structural similarity. The NGLVieweR() function was then used to visualize the 3D structures; this will show the folding and structure of the proteins, as well as looking at whether there are parts of the domain that is truncated/missing in one compared to the other.
# Uncomment commands to install the necessary packages
# if (!require("BiocManager", quietly = TRUE))
# install.packages("BiocManager")
# BiocManager::install()
library(BiocManager)
Bioconductor version 3.14 (BiocManager 1.30.18), R 4.1.3 (2022-03-10)
Bioconductor version '3.14' is out-of-date; the current release version '3.15' is available with R
version '4.2'; see https://bioconductor.org/install
Attaching package: ‘BiocManager’
The following object is masked from ‘package:msa’:
version
# BiocManager::install("Biostrings")
library(Biostrings)
# install.packages("seqinr")
library(seqinr)
Attaching package: ‘seqinr’
The following object is masked from ‘package:Biostrings’:
translate
The following object is masked from ‘package:matrixStats’:
count
# BiocManager::install("msa")
library(msa)
# BiocManager::install("muscle")
library(muscle)
# install.packages("bio3d", dependencies=TRUE)
library(bio3d)
Attaching package: ‘bio3d’
The following objects are masked from ‘package:seqinr’:
consensus, read.fasta, write.fasta
The following object is masked from ‘package:Biostrings’:
mask
The following object is masked from ‘package:SummarizedExperiment’:
trim
The following object is masked from ‘package:GenomicRanges’:
trim
The following object is masked from ‘package:IRanges’:
trim
# install.packages("NGLVieweR")
# install.packages("remotes")
# remotes::install_github("nvelden/NGLVieweR")
library(NGLVieweR)
Registered S3 method overwritten by 'htmlwidgets':
method from
print.htmlwidget tools:rstudio
# Amino acid sequences of p53, p63, and p73 proteins are downloaded from UniProtKB as fasta files, ensuring they are saved to the same directory as the R notebook
# Read fasta files using the function readAAStringSet from the Biostrings package and assign the appropriate fasta file to the variables "p53_seq", "p63_seq", and "p73_seq"
p53_seq <- readAAStringSet("p53.fasta")
p63_seq <- readAAStringSet("p63.fasta")
p73_seq <- readAAStringSet("p73.fasta")
# Print out each sequence to ensure the fasta file is read successfully
p53_seq
AAStringSet object of length 1:
width seq names
[1] 393 MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLS...GSRAHSSHLKSKKGQSTSRHKKLMFKTEGPDSD sp|P04637|P53_HUM...
p63_seq
AAStringSet object of length 1:
width seq names
[1] 680 MNFETSRCATLQYCPDPYIQRFVETPAHFSWKE...QTISFPPRDEWNDFNFDMDARRNKQQRIKEEGE sp|Q9H3D4|P63_HUM...
p73_seq
AAStringSet object of length 1:
width seq names
[1] 636 MAQSTATSPDGGTTFEHLWSSLEPDSTYFDLPQ...GGPDEWADFGFDLPDCKARKQPIKEEFTEAEIH sp|O15350|P73_HUM...
# Create a vector containing the 3 amino acid sequences and assign it to the variable "p53_family" in order to run the msa function on these 3 sequences in the same alignment
p53family <- c(p53_seq, p63_seq, p73_seq)
# Confirm the number of sequences in the vector
length(p53family)
[1] 3
# Run the msa function with MUSCLE algorithm on "p53family" and assign it to the variable msa for multiple sequence alignment
msa <- msaMuscle(p53family)
# Show the full length of the alignment
print(msa, show = "complete")
MsaAAMultipleAlignment with 3 rows and 704 columns
aln (1..74) names
[1] -------------MEEPQSDPSVEPP-----------------------LSQETFSDLWKLLPE--------NN sp|P04637|P53_HUM...
[2] MNFETSRCATLQYCPDPYIQRFVETPAHFSWKESYYRSTMSQSTQTNEFLSPEVFQHIWDFLEQPICSVQPIDL sp|Q9H3D4|P63_HUM...
[3] -----------------MAQSTATSP-----------------------DGGTTFEHLWSSLEP--------DS sp|O15350|P73_HUM...
Con -------------???P??Q??VE?P-----------------------LS?ETF?HLW??LE?--------D? Consensus
aln (75..148) names
[1] VLSPLPSQ-------------AMDDLMLSPDDIEQ--W--FT------------EDPGPDEAPRMPEAAPPVAP sp|P04637|P53_HUM...
[2] NFVDEPSEDGATNKI----EISMDCIRMQDSDLSDPMWPQYTNLGLLNSMDQQIQNGSSSTSPYNTDHAQNSVT sp|Q9H3D4|P63_HUM...
[3] TYFDLPQSSRGNNEVVGGTDSSMDVFHLEGMTTSV-----MAQFNLLSSTMDQMSSRAASASPYTPEHAASVPT sp|O15350|P73_HUM...
Con ???DLPS?????N??----??SMD???L???D?S?--W--?T???LL?S???Q??????S?SPY?PEHA??V?T Consensus
aln (149..222) names
[1] APAAPTPAAPAPAPSWPLSSSVPSQKTYQGSYGFRLGFLHSGTAKSVTCTYSPALNKMFCQLAKTCPVQLWVDS sp|P04637|P53_HUM...
[2] AP-SPYAQPSSTFDALSPSPAIPSNTDYPGPHSFDVSFQQSSTAKSATWTYSTELKKLYCQIAKTCPIQIKVMT sp|Q9H3D4|P63_HUM...
[3] H--SPYAQPSSTFDTMSPAPVIPSNTDYPGPHHFEVTFQQSSTAKSATWTYSPLLKKLYCQIAKTCPIQIKVST sp|O15350|P73_HUM...
Con AP-SPYAQPSSTFD??SPSP?IPSNTDYPGPH?F?V?FQQSSTAKSATWTYSP?LKKLYCQIAKTCPIQIKV?T Consensus
aln (223..296) names
[1] TPPPGTRVRAMAIYKQSQHMTEVVRRCPHHERCSD-SDG-LAPPQHLIRVEGNLRVEYLDDRNTFRHSVVVPYE sp|P04637|P53_HUM...
[2] PPPQGAVIRAMPVYKKAEHVTEVVKRCPNHELSREFNEGQIAPPSHLIRVEGNSHAQYVEDPITGRQSVLVPYE sp|Q9H3D4|P63_HUM...
[3] PPPPGTAIRAMPVYKKAEHVTDVVKRCPNHELGRDFNEGQSAPASHLIRVEGNNLSQYVDDPVTGRQSVVVPYE sp|O15350|P73_HUM...
Con PPPPGT?IRAMPVYKKAEHVTEVVKRCPNHEL?RDFNEGQ?APPSHLIRVEGN???QYVDDP?TGRQSVVVPYE Consensus
aln (297..370) names
[1] PPEVGSDCTTIHYNYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENLRKKGE- sp|P04637|P53_HUM...
[2] PPQVGTEFTTVLYNFMCNSSCVGGMNRRPILIIVTLETRDGQVLGRRCFEARICACPGRDRKADEDSIRKQQV- sp|Q9H3D4|P63_HUM...
[3] PPQVGTEFTTILYNFMCNSSCVGGMNRRPILIIITLEMRDGQVLGRRSFEGRICACPGRDRKADEDHYREQQAL sp|O15350|P73_HUM...
Con PPQVGTEFTTILYNFMCNSSCVGGMNRRPILIIITLE?RDGQVLGRRSFE?RICACPGRDRKADED??RKQQ?- Consensus
aln (371..444) names
[1] -PHHELPPGSTKRALPNNTSSSPQ-----PKKKPLDGEYFTLQIRGRERFEMFRELNEALELKD---------- sp|P04637|P53_HUM...
[2] -SDSTKNGDGTKRPFRQNTHGIQM--TSIKKRRSPDDELLYLPVRGRETYEMLLKIKESLELMQYLPQHTIETY sp|Q9H3D4|P63_HUM...
[3] NESSAKNGAASKRAFKQSPPAVPALGAGVKKRRHGDEDTYYLQVRGRENFEILMKLKESLELMELVPQPLVDSY sp|O15350|P73_HUM...
Con -??S?KNG??TKRAF?QNT???P?--???KKRR??D?E??YLQVRGRE?FEML?KLKESLELM???PQ?????Y Consensus
aln (445..518) names
[1] -------------------------------------------------------------------------- sp|P04637|P53_HUM...
[2] RQQQQQQHQHLLQKQTSIQSPSSYGNSSPPLNKMN-SMNKLPSVSQLIN--PQQRNALTPTTIPDGMGANIPMM sp|Q9H3D4|P63_HUM...
[3] RQQQQ-----LLQRPSHLQ-PPSYGPVLSPMNKVHGGMNKLPSVNQLVGQPPPHSSAATPNLGPVGPGM-LNNH sp|O15350|P73_HUM...
Con RQQQQ-----LLQ?????Q-P?SYG????P?NK??-?MNKLPSV?QL??--P????A?TP???P?G?G?-???? Consensus
aln (519..592) names
[1] ----AQAGKEPGGSRAH--------------------------------------------------------- sp|P04637|P53_HUM...
[2] GTHMPMAGDMNGLSPTQALPPPLSMPSTSHCTPPPPYPTDCSIVSFLARLGCSSCLDYFTTQGLTTIYQIEHYS sp|Q9H3D4|P63_HUM...
[3] GHAVPANGEMSSSHSAQ------SMVSGSHCTPPPPYHADPSLVSFLTGLGCPNCIEYFTSQGLQSIYHLQNLT sp|O15350|P73_HUM...
Con G???P?AG?M?G?S?AQ------SM?S?SHCTPPPPY??D?S?VSFL??LGC??C??YFT?QGL??IY?????? Consensus
aln (593..666) names
[1] ---------------------------------SSHLKSKKGQSTS---------------------------- sp|P04637|P53_HUM...
[2] MDDLASLKIPEQFRHAIWKGILDHRQLHEFSSPSHLLRTPSSASTVSVGSSETRGERVIDAVRFTLRQTISFPP sp|Q9H3D4|P63_HUM...
[3] IEDLGALKIPEQYRMTIWRGLQDLKQGHDYSTAQQLLRSSNAATISIGGSGELQRQRVMEAVHFRVRHTITIPN sp|O15350|P73_HUM...
Con ??DL??LKIPEQ?R??IW?G??D??Q?H??S??S?LLRS???ASTS??GS?E????RV??AV?F??R?TI??P? Consensus
aln (667..704) names
[1] ---------------------RHKKLMFKTEGPDSD-- sp|P04637|P53_HUM...
[2] R-------DEWNDFNFDMDARRNKQQRIKEEGE----- sp|Q9H3D4|P63_HUM...
[3] RGGPGGGPDEWADFGFDLPDCKARKQPIKEEFTEAEIH sp|O15350|P73_HUM...
Con R-------DEW?DF?FD????R?KKQ?IKEEG????-- Consensus
# Read pdb files of the DNA binding domains of p53 (2FEJ), p63 (2RMN), and p73 (2XWC) using the read.pdb() function from the bio3d package by providing the file paths (or PDB accession numbers) and assign them to the variables "p53_pdb", "p63_pdb", and "p73_pdb)
p53_pdb <- read.pdb("2fej.pdb")
p63_pdb <- read.pdb("2rmn.pdb")
p73_pdb <- read.pdb("2xwc.pdb")
PDB has ALT records, taking A only, rm.alt=TRUE
# Create a multiple sequence alignment of the PDB files using the pdbaln() function and assign it to the variable "pdbs"
pdbs <- pdbaln(c("2fej.pdb", "2rmn.pdb", "2xwc.pdb"), fit=TRUE, web.args = list(email = "akristin@ucsd.edu"))
Reading PDB files:
2fej.pdb
2rmn.pdb
2xwc.pdb
.. PDB has ALT records, taking A only, rm.alt=TRUE
.
Extracting sequences
Warning in system(paste(exefile, ver), ignore.stderr = TRUE, ignore.stdout = TRUE) :
error in running command
Will try to align sequences online...
Job successfully submited (job ID: muscle-R20220605-050739-0407-59238102-p2m)
Waiting for job to finish...Done.
pdb/seq: 1 name: 2fej.pdb
pdb/seq: 2 name: 2rmn.pdb
pdb/seq: 3 name: 2xwc.pdb
PDB has ALT records, taking A only, rm.alt=TRUE
# Print "pdbs" to view the alignment
pdbs
1 . . . . . . 70
2fej.pdb ----------SSSVPSQKTYQGSYGFRLGFLHSGTAKSVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPP
2rmn.pdb GSSTFDALSPSPAIPSNTDYPGPHSFDVSFQQSSTAKSATWTYSTELKKLYCQIAKTCPIQIKVMTPPPQ
2xwc.pdb ----------APVIPSNTDYPGPHHFEVTFQQSSTAKSATWTYSPLLKKLYCQIAKTCPIQIKVSTPPPP
^**^ * * * ^ * * **** * *** * *^^**^*****^*^ * ^ **
1 . . . . . . 70
71 . . . . . . 140
2fej.pdb GTRVRAMAIYKQSQHMTEVVRRCPHHERCSD-SDG-LAPPQHLIRVEGNLRVEYLDDRNTFRHSVVVPYE
2rmn.pdb GAVIRAMPVYKKAEHVTEVVKRCPNHELSREFNEGQIAPPSHLIRVEGNSHAQYVEDPITGRQSVLVPYE
2xwc.pdb GTAIRAMPVYKKAEHVTDVVKRCPNHELGRDFNEGQSAPASHLIRVEGNNLSQYVDDPVTGRQSVVVPYE
* ^*** ^** *^*^**^*** ** ^ ^* ** ******** *^^* * * **^****
71 . . . . . . 140
141 . . . . . . 210
2fej.pdb PPEVGSDCTTIHYNYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENLRK
2rmn.pdb PPQVGTEFTTVLYNFMCNSSCVGGMNRRPILIIVTLETRDGQVLGRRCFEARICACPGRDRKADEDSIRK
2xwc.pdb PPQVGTEFTTILYNFMCNSSCVG---RRPILIIITLEMRDGQVLGRRSFEGRICACPGRDRKADEDHYRE
** **^^ **^ **^******^* ***** *^*** *^^*** ** *^********^ ^*^ *
141 . . . . . . 210
211 . . 233
2fej.pdb K--------GEPHH---------
2rmn.pdb QQVSDSTKNGDAFRQNTHGIQMT
2xwc.pdb A--------ENLYFQ--------
211 . . 233
Call:
pdbaln(files = c("2fej.pdb", "2rmn.pdb", "2xwc.pdb"), fit = TRUE,
web.args = list(email = "akristin@ucsd.edu"))
Class:
pdbs, fasta
Alignment dimensions:
3 sequence rows; 233 position columns (201 non-gap, 32 gap)
+ attr: xyz, resno, b, chain, id, ali, resid, sse, call
# Calculate percent sequence identity and assign the resulting matrix array to the variable "seqid"
seqid <- seqidentity(pdbs)
# Rename column and row names for easy interpretation
colnames(seqid) <- c("p53", "p63", "p73")
rownames(seqid) <- c("p53", "p63", "p73")
# Print renamed matrix
seqid
p53 p63 p73
p53 1.000 0.554 0.567
p63 0.554 1.000 0.824
p73 0.567 0.824 1.000
# Create a function to interpret sequence identity values that returns "homolog" if value is > 0.5 or 50% and returns "non-homolog" if value is =< 0.5
interpret <- function(seqid){
if (seqid > 0.5) {
print ("homolog")
return(seqid)
} else {
print ("non-homolog")
return (seqid)
}
}
# Create variables for each pairing of the 3 proteins
p53andp63 <- seqid[2,1]
p53andp73 <- seqid[3,1]
p63andp73 <- seqid[2,3]
# Use the created function to interpret sequence identity values
interpret(p53andp63)
[1] "homolog"
[1] 0.554
interpret(p53andp73)
[1] "homolog"
[1] 0.567
interpret(p63andp73)
[1] "homolog"
[1] 0.824
# Calculate RMSD to measure structural similarity
rmsd <- rmsd(pdbs, fit = TRUE)
Warning in rmsd(pdbs, fit = TRUE) :
No indices provided, using the 201 non NA positions
# Rename column and row names for easy interpretation
colnames(rmsd) <- c("p53", "p63", "p73")
rownames(rmsd) <- c("p53", "p63", "p73")
# Print renamed matrix
rmsd
p53 p63 p73
p53 0.000 4.980 2.979
p63 4.980 0.000 4.004
p73 2.979 4.004 0.000
msaPrettyPrint(msa, output="tex", showNames="left", showLogo="top", logoColors="rasmol", shadingMode="functional", shadingModeArg="structure", showLegend=FALSE, askForOverwrite=FALSE)
knitr::include_graphics("msaSequenceLogos1.png")
knitr::include_graphics("msaSequenceLogos2.png")
knitr::include_graphics("msaSequenceLogos3.png")
## Source:: https://github.com/nvelden/NGLVieweR#:~:text=NGLvieweR%20provides%20an%20R%20interface,in%20R%20and%20Shiny%20applications.
## Source: https://cran.r-project.org/web/packages/NGLVieweR/vignettes/NGLVieweR.html
NGLVieweR("2FEJ") %>%
addRepresentation("cartoon")
NGLVieweR("2RMN") %>%
addRepresentation("cartoon")
NGLVieweR("2XWC") %>%
addRepresentation("cartoon")
Multiple sequence alignment on p53, p63, and p73 protein sequences were performed using the msa function with the MUSCLE algorithm. A dash in one sequence (-) indicates a deletion in that residue position, or a missing/truncated sequence compared to the other sequences aligned. As shown in the alignment, p63 and p73 are more similar in length (680 and 636 amino acid residues) compared to p53 (only 393 residues). It can be observed in the alignment that p53 has a shorter/truncated C-terminal region compared to p63 and p73, as p53 shows mostly dashes towards the end of the alignment (approximately positions 434-704). This supports the previous findings that an extended C-terminal region containing a sterile alpha motif (SAM) known to regulate development is alternatively spliced in p63 and p73 but not in p53, which suggests that p63 and p73 play an additional separate role in regulating normal development (Levrero 2000). Despite so, the majority of the DNA-binding domain residues (usually defined as residues 94-292 in p53) are conserved across the 3 proteins as evident from the sequence logo.
Following this observation, the DNA-binding domains of p53, p63, and p73 are analyzed by performing a multiple sequence alignment on the pdb files of the DNA-binding domain structures of the 3 proteins. As shown in the alignment, there are less dashes observed, meaning there aren’t many indels in the DBD, residues in the DBD of the 3 proteins are mostly conserved. Using the resulting pdb alignment, their sequence identity is calculated using the seqidentity() function, which returns a matrix of sequence identity values between the 3 proteins. Using the created interpret() function, it can be observed that all 3 proteins are “homologs” to each other as their sequence identity is greater than 50%. The sequence identity between p53 and p63 is 55.4%; p53 and p73 is 56.7%; p63 and p73 is 82.4%. Consistent with our previous observations from the aligments performed, p63 and p73 are more similar to each other than to p53. The sequence identity between p53 and p63 compared to p53 and p73 is almost the same, with p73 being slightly more similar to p53. Finally, root-mean-square deviation values is calculated using the rmsd() function. The rmsd value gives the average deviation between the corresponding atoms of two proteins: the smaller the rmsd, the more similar the two structures; the rmsd would be 0 between identical structures. The rmsd values matrix shows that the value between p53 and p63 is 4.980; p53 and p73 is 2.979; p63 and p73 is 4.004. Generally, a value of < 2 is considered structurally very similar. Based on these values, p53 and p73 is most structurally similar, followed by p63 and p73, and finally p53 and p63 although they are all above 2. Meaning, despite their high sequence identities, the 3 proteins are structurally distinct in their DBD. This is evident by visualization of the structures using NGLVieweR as well; while all the 3 structures seem distinct, p53 and p73 look the most similar as they are spherical in shape, while p63 looks like it has two “domains” with a bridge connecting the two.
Based on these analyses, my hypothesis was partially correct, as the sequence identities between the 3 proteins are >50%. However, it is still and unclear whether the p53 family of proteins serve the same or distinct functions; consistent with what we know today, the p53 family has some overlapping functions but also distinct ones, which can be explained by their seemingly similar sequence identities, especially in the DNA-binding domain, but not so similar in structure.